Issues in Exploiting GermaNet as a Resource in Real Applications 
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Abstract 

This paper reports about experiments 
with GermaNet as a resource within do- 
main specific document analysis. The 
main question to be answered is: How 
is the coverage of GermaNet in a spe- 
cific domain? We report about results 
of a field test of GermaNet for anal- 
yses of autopsy protocols and present 
a sketch about the integration of Ger- 
maNet inside XDOC.^ Our remarks will 
contribute to a GermaNet user's wish 
Ust. 

1 Introduction 

GermaNet - a lexical-semantic net - was 
developed in the context of the LSD- 
project: "Ressourcen und Methoden zur 
semantisch-lexikahschen Disambiguierung" 
dHinrichs etal, 1998D . This paper describes an 
experiment about the integration of GermaNet 
into the Document Suite XDOC. The Document 
Suite XDOC was designed and implement- 
ed as a workbench for flexible processing of 
electronically available documents in German 
(|Rosner a nd Kunze, 2002| ). 

We currently are experimenting with XDOC in 
a number of application scenarios. These include: 

• Knowledge acquisition from technical docu- 
mentation about casting technology as sup- 
port for domain experts for the creation of a 
domain specific knowledge base. 

'XDOC stands for XML based document processing. 



• Extraction of company profiles from WWW 
pages for an effective search for products and 
possible suppliers. 

• Information extraction from English MedUne 
abstracts. 

• Analysis of autopsy protocols for e.g. statis- 
tical investigation of typical injuries in traffic 
accidents. 

The end users of our applications are domain 
experts (e.g. medical doctors, engineers, ...). They 
are interested in getting their problems solved but 
they are typically neither interested nor trained in 
computational linguistics. Therefore the barrier 
to overcome before they can use a computation- 
al linguistics or text technology system should be 
as low as possible. 

Many of our tools have an extensive need for 
linguistic resources. Therefore we are interest- 
ed in ways to exploit existing resources with 
a minimum of extra work. The resources of 
GermaNet promise to be helpful for different 
tasks in our workbench. GermaNet - a German 
version of the Princeton WordNet (Mil ler, 19901 
Fellbauni7l998|^ - is based on the same design 
principles, i.e. database structures like Word- 
Net. The intention of GermaNet is defined as 
the coverage of basic vocabulary of the German 
language - based on lemmatized frequency lists 
from text corpora (see ( ,Hamp and Feldweg, 1997| l 
or dKunze, 20011 ). 

The following scenarios for the integration of 
the GermaNet resource in our work are possible 
(see also section |5li: 



• GermaNet as resource for semantic analyses, 

• GermaNet for a shallow recognition of im- 
plicit document structures, 

• GermaNet for compound analysis. 

This paper is organized as follows: first we give 
a short description of the document class 'autop- 
sy protocols', because the examples of this paper 
are based on this corpus. After this we describe 
our results related to the coverage of GermaNet 
for our corpus and how the ambiguities inside the 
results can be resolved. Then we shortly sketch 
the semantic module of XDOC into which the re- 
sources of GermaNet should be integrated. This 
is followed by the presentation and discussion of 
results from our experiments. Our remarks will fi- 
nally contribute to a GermaNet user's wish list. 

Characteristics of the Document Class: 
Autopsy Protocols 

Autopsy protocols are especially amenable to pro- 
cessing with techniques from computational lin- 
guistics and knowledge representation: 

• Forensic autopsy protocols are in most cas- 
es written with the clear constraint that they 
will be used for legal purposes and will have 
to be interpretable by lawyers and other non- 
medical experts. 

• Autopsy protocols are highly structured and 
follow a strict ordering. 

• The sub-language of the Findings section is 
of a telegramatic style with a preference for 
'verbless' structures. The sub-language of 
other subdocuments is slightly more com- 
plex, but still limited due to the communica- 
tive requirements (e.g. precision, unique- 
ness of expression, understandability for non- 
experts). 

2 GermaNet and Autopsy Protocols 

In the following we do report about on going ex- 
periments with a corpus of currently approx. 600 
autopsy protocols from Magdeburg. The corpus 
will soon be extended with protocols from other 
institutes for forensic medicine from all parts of 



Table 1: Coverage for Different Document Parts. 

document type word types match percentage coverage 
Findings 17520 " 2591 14,78 

Background 8124 " 2274 27,99 

Discussion 8562 1862 21,74 



Germany and shall in the long run be representa- 
tive for autopsy protocols from all German speak- 
ing countries. 

The central question for our experiments: Given 
a corpus with texts from a uniform domain how is 
GermaNet's coverage as such, i.e. without any in- 
vestment in extending the available GermaNet re- 
sources? We did not attempt lexical analysis of the 
tokens derived from our test corpus, except com- 
parison with exhaustive lists of tokens from closed 
word classes of German function words, connec- 
tors, prepositions, etc. This is reflecting the sit- 
uation when a corpus from a new domain is pro- 
cessed for the first time and many domain terms 
are new and not covered in lexical resources. 

2.1 Coverage of GermaNet 

First experiments with GermaNet demonstrate 
the coverage of GermaNet for autopsy protocols. 
Table Q shows the coverage rates for the central 
document parts of an autopsy protocol. For this 
evaluation we use tokens, restricted by following 
parameters: 

• The candidates are not function words, like 
conjunctions, prepositions, etc. Only words 
that are potential candidates for nouns, adjec- 
tives and verbs are tested. 

• In autopsy protocols some tokens are 'im- 
plicit markup', e.g. enumerations of titles or 
paragraphs like 'II.' in 'II. Innere Besichti- 
gung'. These tokens were excluded from the 
test. 

• The length of a potential candidate was re- 
stricted to greater than three characters. 

With these restrictions we reduce the number of 
different tokens to be evaluated in section Findings 
from 18492 to 17520, in section Background from 
8901 to 8124 and in section Discussion from 9198 
to 8562 tokens. 



Table 2: Coverage for Different Word Classes. 



document type 


nouns 


verbs 


adjectives 


Findings 


1573 


351 


806 


Background 


1622 


328 


465 


Discussion 


1162 


322 


483 



The least coverage of GermaNet exists in the 
section Findings. This is not astonishing, because 
there we have many domain specific terms (e.g., 
like 'Thalamus', 'submandibularis', 'Himkontu- 
sionen' or Tnjektion'). In addition, the medical 
doctors use their own (subjective) vocabulary for 
the description of injuries or other findings, like 
'weichkaseartig', 'metallstecknadelkopfgroBe' or 
'teerstuhlartiger'. The best coverage could be 
achieved in the section Background. Here we have 
many words from common language. This docu- 
ment part describes the case history (e.g. details 
of a traffic accident). We rarely find domain spe- 
cific terms in this section. The section Discussion, 
which combines the results of the Findings section 
and the facts from the Background section, ranks 
in the middle with a 21,74 percentage. 

A segmentation of the coverage rates into differ- 
ent word classes is shown in table |2l In these data 
all hits are counted, without distinction whether a 
GermaNet entry exists for one or more word class- 
es, therefore the sum of a row is greater than the 
number of matches in tabled 

In the coverage summary the word class adverb 
is ignored, because at the time of writing there are 
only two synsets for adverbs available in the ver- 
sion of GermaNet and we got zero matches in our 
corpus for these adverbs. 

Related to the word class we have uniform re- 
sults across subdocuments, the largest coverage 
figure is for nouns, followed by adjectives and 
verbs. The high ratio of adjectives in the section 
Findings is due to the high frequent usage of ad- 
jectives in this section. 

2.2 Characteristic of Uncovered Terms 

The tokens that had no entry in GermaNet can be 
divided into two classes. Beside the uncovered 
lexical terms (like 'Rotor' or 'Klinge') we have a 
lot of specific terms, which could not be covered 
by GermaNet. The analysis of these uncovered 



specific terms, which negatively affect the results 
above, gives the following classification. 

measured values and ranges: '2cm', '4-9', '120mr, 

named entities: 'Beck', 'Otto-von-Guericke-Universitat', 
'Opel', 'Salvator-Krankenhaus', 'B269', 'Zehringen- 
Sibbendorf, 

truncations: '-aussenseite', '-wischspuren', 

compounds: 'Plastikdreipunktsicherheitsschluessel' , 

'Oberschenkelspiralmehrfragmentfraktur', 'weisslich- 
gelblich-roetlich-fleckige' , 

inflected words: 'Armes', 'besitzt', 'entnommen', 

misspellings: 'Herzmnuskulatur', 'Herrrren-T-Shirt', 
'Todeseinritt'. 

The first category are non-lexical tokens. Depend- 
ing on the domain and text type their form and 
frequency is varying. They cannot be expected to 
be covered by GermaNet and are best treated with 
special recognizers (e.g. regular expressions). 

All items of the first three categories can be 
preselected by different preprocessing steps, like 
regular expressions or methods for named entity 
recognition. The categories misspellings and in- 
flected words can only successful (in terms of Ger- 
maNet) be preprocessed by a complex morpholog- 
ical component, including recognition of inflect- 
ed words and orthographic similar words. For the 
processing of compounds in GermaNet it is pos- 
sible to use the resources of GermaNet itself (see 
section |51)- 

3 Resolving Ambiguities 

In this section we discuss approaches for resolving 
ambiguities. The discussion is related to the kind 
of ambiguity. In our use of GermaNet we found 
three types of ambiguities. Type one is an ambi- 
guity on the POS level - whether the token to be 
analysed is for example a noun or a verb. The sec- 
ond type occurs when more than one sense exists 
for a word class. The last type is a combination of 
the first two types. 

3.1 Part-of-Speech Ambiguity 

Table |3l shows the ratios of entries with Part-of- 
Speech ambiguity.^ 

"in short: POS ambiguity 



Table 3: POS Ambiguity. Table 4: Sense Ambiguities. 





Findings 


Background 


Discussion 


different 
word 

classes 


139 (5,36) 


135 (5,93) 


104 (5,58) 


Nand V 


72 (2,77; 51,79) 


89 (3,9; 65,9) 


71 (3,8; 68,2) 


N and 
ADJ 


64 (2,47; 46,04) 


35 (1,15; 25,92) 


31 (1,66; 29,8) 


V and 
ADJ 


3 (0,11; 2,15) 


5 (0,21; 3,7) 


1 (0,05; 0,96) 


N, V and 
ADJ 


(0; ) 


6 (0,26; 4,44) 


1 (0,05; 0,96) 



The first row are results of counting all match- 
es with more than one word class per literal, the 
percentage rate related to all matches is given in 
parentheses. 

The rows 2 to 5 present the number of match- 
es in which a specific combination of word class- 
es, e.g. noun and verb, occurs. The first value in 
parentheses displays the percentage rate related to 
all matches and the second value is the percentage 
rate related to all matches with POS ambiguity. 

In all three document parts the highest case of 
POS ambiguity occurs between nouns and verbs. 
For example, the token 'Herzens' in the phrase 
'Gewicht des Herzens will be interpreted in 
GermaNet both as noun and as verb. 

Due to the verbless style for this section it is 
not astonishing that only in the section Findings a 
similar high ratio is given for the case 'nouns and 
adjectives'. 

It can be assumed, that a simple check of cap- 
italisation of a token can probably decrease the 
rates of POS ambiguity. Taking sentence initial 
positions into account simple upper-/lowercase 
distinction could decrease the rate of 'noun-verb' 
or 'noun-adjective' matches. 

Another approach is based on POS informa- 
tion about the tokens to be analysed (using e.g. 
MORPHIX dFinkler and Neumann, 19881 ). With 
this additional POS information we can directly 
decide which information we want to retrieve in 
GermaNet. In addition, we can also use a sim- 
ple heuristic approach based on the information 
about the document section. In the section Find- 
ings readings of adjectives can be preferred over 
readings as verbs. 



ratio percentage 
Findings 1034 39.95 

Background 914 40,26 

Discussion 823 44,27 



3.2 Sense Ambiguity 

The average number of senses for a token of our 
corpus covered by GermaNet is approx. 1,76. The 
highest number we get is for verbs with ca 3 senses 
(average numbers of senses for verbs: 3,18; nouns: 
1,49 and adjectives: 1,62). It is apparent that in 
many cases GermaNet returns more than one sense 
for an entry. Tabled shows the number of tokens 
with more than one sense related to the different 
document parts. 

A method for resolving the senses is the use of 
contextual information. The specific structure of 
our documents (division in three main parts) and 
content related separation into these parts allowed 
to exploit this information for the determination of 
the most likely sense. As a start we use here the in- 
formation of the semantic fields of GermaNet. Ex- 
periments show (by majorities) clear differences 
between the parts (see table|5ll. 

Although the subdocuments may differ slightly 
in this respect there is a strong preference for med- 
ical readings (senses) for potentially ambiguous 
words in the corpus of autopsy protocols. This is 
especially true for the subdocument with informa- 
tion about the examination findings. The subdoc- 
ument with the background is the place where the 
expectation for medical senses seems to be weak- 
est. 

Please note that words may have even conflict- 
ing 'medical readings'. 'Blase' may be an organ 
(bladder) or an injury (e.g., caused by fire). 

In the Findings section the most frequent 
GermaNet categories are 'nomen.Korper', 
'verb.Veranderung', 'verb.Lokation' . For re- 
solving ambiguities we use this information 
(majorities) for preselecting senses depending on 
the current document section. For example, the 
noun 'Becken' will be classified by GermaNet 
in the semantic fields 'nomen.Artefakt' (in a 
sense of 'music instrument') and 'nomen.Korper' 
(in a sense of 'bone'). In the analysis of 



Table 5: Typical Semantic Fields of the Document 
Parts, 



section 


most frequent semantic fields 


Findings 


nomen.Korper, verb.Lokation, 
verb.Veranderung, adj.Korper, 
adj.Perzeption, 


Background 


nomen.Geschehen, adj.Zeit, 
adj.Lokation 


Discussion 


nomen.Gescheiien, 
nomen.Korper, verb.Lokation, 
verb.Veranderung, adj. Relation 



the Findings section we prefer the sense of 
'nomen.Korper'. In the section Background the 
sense of 'nomen.Artefakt' has a higher likelihood 
than the sense 'nomen.Korper'. 

3.3 Combined Ambiguity 

These cases are very rare in the corpus: Findings: 
1 1 (0,42 %), Background: 19 (0,83 %) and Discus- 
sion: 15 (0,8 %). They could probably be resolved 
through the approaches that are outlined in section 
13. H and section l3^ 

4 GermaNet inside the Semantic Module 
of XDOC 

The integration of the GermaNet resources takes 
place for the purposes of semantic analysis. In 
this section we outline the strategies for seman- 
tic analysis within XDOC. The Semantic Module 
in XDOC exploits three analysis techniques for 
the annotation of documents with semantic infor- 
mation. The results of the analysis are recorded 
in separate Topic Maps or annotated within doc- 
uments with a specific XML format. At first we 
give a short description of the semantic analyses 
inside XDOC. 

Semantic Tagger. The Semantic Tagger classi- 
fies content words into their semantic categories 
(different applications may have different orga- 
nizations of those categories in the form of tax- 
onomies or ontologies). For this function we ex- 
pect as input data a text tagged with POS tags 
and we then apply a semantic lexicon. This lex- 
icon contains the semantic interpretation of a to- 
ken and a case frame combined with the syntactic 
valence requirements. Similar to POS tagging, the 
tokens in the input are annotated with their mean- 



ings and with a classification into semantic cate- 
gories (i.e. specific concepts or relations). It is 
possible that the classification of a token in isola- 
tion is not unique. In analogy to the POS tagger, 
a semantic tagger that processes isolated tokens is 
not able to disambiguate between multiple seman- 
tic categorisations. This task is postponed for con- 
textual processing within case frame analysis (Se- 
mantic Parser). 

Semantic Parser. The Semantic Parser is one 
method in XDOC for the assignment of seman- 
tic relations between isolated (but related) tokens. 
By case frame analysis of a token we obtain de- 
tails about the type of recognized concepts (re- 
solving multiple interpretations) and possible re- 
lations to other concepts. Fig. ^ contains the re- 
sults of the analysis of the noun phrase 'Unfallab- 
lauf mit Herausschleudern der Koerper aus dem 
PKW. We get here the assignments of the rela- 
tion part between 'Unfallablauf and 'Herauss- 
chleudern der Koerper aus dem PKW and the re- 
lations location (between 'Herausschleudern' and 
'PKW') and patient (between 'Herausschleudern' 
and 'Koerper'). 




Figure 1: Results of Case Frame Analysis. 

Semantic Interpretation of Syntactic Structure 
(SIsS). An other step for the recognition of re- 
lations between tokens is the Semantic Interpre- 
tation of syntactic Structure of a phrase or sen- 
tence respectively. We exploit the syntactic struc- 



ture of the language (e.g., structures of noun phras- 
es) and the semantic interpretation of tokens inside 
the structure to extract relations between several 
tokens. Fig. |2is a visualization of the results of 
the analysis of the noun phrase 'dunkelrote Un- 
terblutung der Schleimhaut der Mere'. The anal- 
ysis of this complex noun phrase results in three 
relations between the separated nouns. The rela- 
tion prop is used to label properties of a concept. 
Our future work here: The generic relation gen- 
attribute (short for attribute based on a genitive 
surface case) has to be resolved into the appropri- 
ate more specific relations, like part-of or patient. 




,part-of 



Figure 2: Results of SIsS. 

The core of all these semantic analyses tech- 
niques is a semantic lexicon. This lexicon records 
the meanings and case frames (only for nouns and 
verbs) of a word. 

Up to now the entries of this lexicon have been 
manually built up and are partially domain depen- 
dent. Now we want to integrate the GermaNet re- 
sources into our framework. 

4.1 Integration of GermaNet 

Currently the integration of GermaNet is realised 
in the semantic tagger. For the semantic lexicon 
we use the conceptual relation hypernym of Ger- 
maNet. The tagger uses the first level of the hy- 
pernym relation for the annotation of tokens with 
information about the GermaNet senses: 

(tag-semantic-xml "<N>Leber</N> <S-KONJ>und</S-KONJ> 
<N>Niere</N>") 

"<CONCEPT TYPE=" Innerei; Verdauungsorgan">Leber 

</CONCEPT> <XXX><S-KONJ>und</S-KONJ></XXX> 

<CONCEPT TYPE="Innerei; Harnorgan">Niere </CONCEPT>" 



The XML-attribute TYPE contains the hyper- 
nym information from GermaNet. The different 
senses are separated by a semicolon. 

For better results we reconfigured our semantic 
tagger. In contrast to the early version the seman- 
tic tagger now also expects tokens with POS in- 
formation (word classes), but enriched with addi- 
tional information about the stem of the tokens. In 
this way we ask for senses related to a word class 
with the facility to use a non-inflected word form 
for the request. 

(tag-semantlc-xml "<N STEM="Gewicht ">Gewicht</N> 

<DETD>des</DETD><N STEM= "Her z " >Her zens </N> " ) 

"<CONCEPT TYPE=" ?physikalisches Attribut; Wichtigkeit 
Messgeraet, Messgeraet*o, Messinstrument*o, 
Messinstrument ; Artefakt, Werk">Gewicht</CONCEPT> 
<XXX><DETD>des</DETD></XXX> <CONCEPT TYPE=" Innerei ; 
Organ; Farbe, Spielfarbe; Flaeche, Ebene">Herzens 
</CONCEPT>" 

Another integration of the GermaNet resources 
is possible for the Semantic Parser. Here we could 
use the information of verb frames. Up to now the 
mapping of GermaNet verb frames to the XDOC 
case frames could be problematic. For case frames 
we use in addition to syntactical valency (e.g., 
noun phrase in accusative) also the description of 
potential semantic roles for the filler of the frame. 
This information is not available from GermaNet's 
verb frames. For this integration of GermaNet it 
is necessary to complete the additional semantic 
information manually or by a corpus based ap- 
proach (learning from corpora). For instance, for 
the analysis of the sentence 'Sie wurde am Kopf 
operiert.' we get for the verb 'operieren' the Ger- 
maNet sense: 

Sense 1 operieren 

-> medizinisch behandeln 

-> wandeln, andern, mutieren, verandern 

GermaNet contains for this sense following verb 
frames: 

Sense 1 
operieren 

*> NN.AN 
*> NN.AN.BL 

The second verb frame matches our example 
sentence. 

But the usage of these GermaNet's verb frames 
in the analysis of the sentence 'Sie wurde im KKH 
XXX am Arm operiert.' is problematic because the 
BL complement could be assigned to the locative 
preposition phrase 'im KKH xxx' or to the loca- 
tive preposition phrase 'am Arm'. One of the 



two prepositional complements gets no direct as- 
signment to a complement defined by GermaNet's 
verb frames. Other similar problematic examples 
from our corpus are: 

• Nach polizeilichen Angaben aus der Akte und den 
klinischen Unterlagen wurde G xxx/xx am Morgen 
des dd.mm.jj im Krankenhaus X wegen einer knotigen 
Kropfbildung operiert (Strumaresektion). 

• Am dd.mm.jj wurde G xxx/xx im KKH xxx am Herzen 
operiert. 

A detailed description, e.g. additional information 
about the semantic role of the complement's con- 
tent, could be helpful for the analysis. Our Se- 
mantic Parser works with such information. For 
the usage of the verb frames for the analysis with 
our Semantic Parser we need additional features 
for the Adverbial Complement (BL) of the verb 
frame:^ 

• semantic role of the filler: body part, for ex- 
ample, organs or extremities, 

• possible preposition: am, 

• case of PP: dative or not specified.^ 

Other features to be considered in using verb 
frames are: 

• the different complement forms for active or 
passive usage of a verb and 

• the number of a noun phrase: For example, 
for the verb 'koUidieren' is the possible verb 
frame 'NN.Pp'. The preposition phrase is de- 
fined as an optional complement. A neces- 
sary additional feature for the noun phrase 
is the information about its number (singu- 
lar or plural). For example, the subject noun 
phrase in sentences like 'Die Fahrzeuge kol- 
lidierten.' must name more than one partici- 
pant of the accident. 

To complete GermaNet's verb frames it is pos- 
sible on the one hand to add this additional infor- 
mation manually or on the other hand by the anal- 
ysis of occurrences of similar phrases in the cor- 
pus. By the corpus based approach the user gets 

'when we assumed that BL is a preposition phrase, 
''when no unique assignment to one case is possible. 



a list of possible complements for a verb, so that 
the verb frame of GermaNet can be enriched with 
the corpus/context related features. GermaNet's 
verb frames are used as pattern for the search in- 
side the corpus. The basis for this approach is a 
corpus with syntactic structures annotated by the 
Syntactic Parser of XDOC jRosner, 2000t . 

One problem inside the SIsS analysis is the cor- 
rect interpretation of the genitive-relation. One 
solution is the usage of the conceptual relations 
meronym and holonym of GermaNet. For exam- 
ple, the results of the SIsS analysis of the phrase 
'unauffaelliger Vorhof des Herzens' is shown in 
Fig.E 




.gen-attribute 



Herzens 



Figure 3: Result for the Phrase 'unauffaelliger 
Vorhof des Herzens'. 

By the SIsS analysis two relations were recog- 
nised: first the prop relation between the tokens 
'unauffaellig' and 'Vorhof, the second recognised 
relation is the gen-attribute relation. This rela- 
tion can in general be interpreted in several ways 
(e.g. as part-of or patient-of). GermaNet results 
for the token 'Herz' (in the sense of organ) give 
the meronyms: Vorhof, Herzklappe, Herzkam- 
mer, Herzrohr, linke Herzhdlfte, rechte Herzhdlfte, 
Herzmuskel, Herzkranzgefdfi and for the token 
'Vorhof the holonym 'Herz'. With meron and 
holon information from GermaNet we can decide 
that the generic relation ('gen-attribute') between 
the tokens 'Herz' and 'Vorhof is a part-of rela- 
tion. 

4.2 Practical Aspects of the Integration of 
GermaNet 

The technical access to GermaNet was realised in 
different ways: offline and online usage of Ger- 



maNet. In offline usage GermaNet is transformed 
into an application specific resource. This trans- 
formation may be carried out as a compilation step 
beforehand. Online usage employs GermaNet re- 
sources via their API. 

For the offline usage of GermaNet we only 
transform necessary information into the applica- 
tion specific resources. Depending on the task 
to be performed we do need different informa- 
tion from GermaNet. In one case we need the 
synsets and hypemyms (Semantic Tagger), in oth- 
er cases we only work with information about 
the semantic fields of a token (for example, sce- 
nario: shallow recognition of document sections). 
The relations inside GermaNet can also be de- 
scribed as path direction - up, down, horizontal 
(see pirst and St-Onge, 1998] )). Within the se- 
mantic module of XDOC the following 'path di- 
rections' could be useful: 

Semantic Tagging: search for a context-based allowed 
sense (hypemyms, synsets), 

Semantic Parser: assignment to semantic roles, search for 
a filler of a semantic role (hypemyms), syntactic infor- 
mation via verb frames, 

SIsS: resolving of semantic interpretation of genitive- 
structures (e.g. Schleimhaut des Magens) by 'meron' 
or 'holon' information. 

Concerning the coverage of GermaNet we ob- 
tained the following results: 

• Some tokens of our corpus are not covered 
by GermaNet, especially in the range of open 
word classes, like adjectives (e.g. 'quer') or 
in the range of domain specific words (e.g. 
'Fraktur'). 

• Uncomplete senses of an entry. For in- 
stance, for the word 'Abfall' there exists on- 
ly one sense related to the semantic field 
'noun.substance', but in our corpus we often 
find the word 'Abfall' in phrases like 'Abfall 
des Blutdruckes'. Please note: For the verb 
'abf alien', from which the noun 'Abfall' is 
derivated by a verb-noun-conversion, we al- 
so do not get the right sense related to our 
domain: 

1 sense of abfallen 
Sense 1 abfallen 



-> losen 

=> ?Dauerkontakt 

From a more technical perspective the following 
points are relevant: 

• A very simple or no morphological compo- 
nent in GermaNet (WordNet is better), e.g. 
'Autos' will be found, but 'Organe' will not 
be found in GermaNet. This is explainable 
only through the use of an English morphol- 
ogy component (from WordNet). GermaNet 
uses English flection criteria for the analysis 
of the input data. By reconfiguration of our 
Semantic Tagger we can avoid this effect (see 
section l4.lt . 

• Use of umlauts in GermaNet: The documents 
in our corpus are without umlauts, but Ger- 
maNet supports only access via writings with 
umlauts. Matching of candidates without 
umlauts to possible candidates in GermaNet 
with umlauts could be helpful and would lead 
to a better coverage. 

In consideration of the last two points we 
worked with two additional intermediate steps in 
our experiment environment. At first we integrat- 
ed the morphological component MORPHIX (re- 
vised results for the different sections are Back- 
ground: 41,39 %, Findings: 29,64 %, Discussion: 
40,57 %) and the second step was the treatment 
of umlauts which again improved our GermaNet 
coverage results (Background: 43,38 %, Findings: 
31,02 %, Discussion: 42,38 %). 

4.3 User's Wish List 

Some items for the GermaNet user's wish list: 

• It seems that in the case of orthographic 
variants GermaNet 'knows' sometimes more 
than it makes available. An example: Ger- 
maNet has the information that '4-eckig' is 
an orthographic variant of 'viereckig', but 
does only return information when the us- 
er (or application program) asks with the 
(canonic) writing 'viereckig'. 

• Flexible match of umlauts and extended writ- 
ings: Given the fact that in computer writ- 
ten text umlauts are still often represented in 



the expanded form of 'ae', 'oe', ... it would 
be helpful to increase the flexibility of Ger- 
maNet's lexicon access and provide means 
that search terms in the expanded writing 
will match existing entries with umlauts (i.e. 
'Gebaeude' should match 'Gebaude'). 

• Avoid artefacts due to English spelling rules 
from Wordnet: Wordnet and GermaNet offer 
convenience functions to the user for search 
in the resources in the sense that some but 
not all inflections, derivations, and alterna- 
tive spellings can be handled. For example: 
'Herzens' matches the verb 'herzen'(!) but 
not the noun 'Herz'.^ 

• Finally: GermaNet is not error free. In 
our work we occasionally get messages like 
'Error Cycle detected' or 'Synset xxxx not 
found', which make the user insecure about 
the results returned by GermaNet. 

5 Discussion: Back to the Scenarios 

In previous sections we have described some inte- 
gration aspects of GermaNet for different scenar- 
ios. Now we give a concrete outline of the scenar- 
ios. 

GermaNet as Resource for Semantic Analyses. 

In section RTD we described the integration of Ger- 
maNet as resource in the Semantic Module of 
XDOC. There we use the lexical-semantic net for 
the annotation of tokens with their semantic roles 
(Semantic Tagger). For this task we exploit the 
different defined relations inside GermaNet (e.g. 
hypernym or synonym). For the tasks of the Se- 
mantic Parser and the SIsS analysis we addition- 
ally use information of verb frames and other con- 
ceptual relations, like the 'meron' and the 'holon' 
relation. The Semantic Parser directly uses this in- 
formation for the analysis, while the SIsS analysis 
uses GermaNet's information in a postprocessing 
step for the selection of one (possible) interpre- 
tation of the different readings resulting from the 
SIsS analysis (e.g. the relation gen-attribute). 

^Please note: 'Herzens' can be erroneously derived from 
the verb 'herzen' under the assumption of an English inflec- 
tion: 'English' morphological attributes of 'Herzens' are then 
third person singular. 



GermaNet for a Shallow Recognition of Implic- 
it Document Structures. In section [2 we have 
given a short specification of autopsy protocols. 
The characteristics of the different document parts 
can be used for a recognition of these parts. The 
following parameters describe the different docu- 
ment parts (also related to the available informa- 
tion by GermaNet): 

Findings: high ratio of nouns and adjectives; short specif- 
ic syntactic (sentence) structures; semantic fields like 
'nomen.Korper', 'adj.Korper', 'verb.Veranderung', 

Background standard distribution of all word classes; 
regular syntactic structures; semantic fields like 
'nomen.Geschehen', 'adj.Zeit', 'adj.Lokation', 

Discussion standard distribution of all word classes; 
regular syntactic structures; semantic fields like 
'nomen.Geschehen', 'nomen.Korper', 'verb.Lokation', 
'verb.Veranderung' . 

The distribution of the semantic fields over dif- 
ferent document parts can be used for the recog- 
nition of these document parts. For example a 
document part with a high frequent occurrence 
of tokens, which can be assigned to the se- 
mantic fields like 'nomen.Geschehen', 'adj.Zeit', 
'adj.Lokation', and no occurrences of tokens with 
assignments of 'nomen.Korper' etc. can be identi- 
fied as the Background section of an autopsy pro- 
tocol. For a unique identification we also use in- 
formation about the word classes by the POS Tag- 
ger and the information about the kind of syntactic 
structures by the Syntactic Parser to confirm the 
other characteristic criteria of a document part. 

GermaNet for Compound Analysis. In the au- 
topsy protocol corpus - as well as in other medical 
or technical texts - noun compounds are quite fre- 
quent. The question here is: Is it possible to 

• safely determine segmentations of noun com- 
pounds and to 

• construct meaning hypotheses for noun com- 
pounds by combining the meaning of the 
compound's parts if they are covered by Ger- 
maNet? 

Please note: Segmentation of German noun 
compounds (i.e. determination of boundaries be- 
tween parts of a noun or noun compound) may 
produce artefacts even when the hypothesized 



compound segments are lexical entries in their 
own right. 

Examples (suggested segmentations indicated 
with [ ... ]): 

Transport ... * [Tran] [sport] 
Lebertransport ... * [Lebertran] [sport] 
[Leber] [transport] 

We therefore favour an approach to compound 
segmentation that additionally takes the corpus 
and the occurrence frequencies of complex words 
with common pre- and suffixes into account and 
thus reduces the dependence on the lexicon and its 
coverage. 

The corpus-based analysis of compounds with 
GermaNet can be described as follows: The first 
step is to find all compounds with similar suffix- 
es inside the corpus, like 'Nierentransplantation', 
'Lebertransplantation' etc. Then define 'Top Lev- 
el' relations between possible candidates for com- 
pounds, for our example: <organs><medical- 
operation>, to avoid a wrong interpretation of 
compounds. Here we can use the semantic field 
information of GermaNet for the description of re- 
lations between possible candidates. 

6 Conclusion 

We have reported about first experiments in inte- 
grating GermaNet resources into XDOC for the 
processing of autopsy protocols. 

Although our results related to the coverage of 
GermaNet were not as high as in Saito's experi- 
ments ( |Saito et al., 2 002'>, the results for a corpus 
of autopsy protocols are encouraging. (A parallel 
experiment with the EUROPARL corpus - avail- 
able at http://www.isi.edu/~koehn - resulted in a 
lower coverage. Of 198546 tested tokens only 
30344 tokens are covered by GermaNet; this prob- 
ably is in part due to the high ratio of named enti- 
ties in the Europarl corpus.) The results could be 
further improved by XDOCs preprocessing steps, 
like named entity recognition, POS tagger etc., so 
that an adoption of GermaNet resources into the 
semantic analyses of XDOC is conceivable. 

We use GermaNet's lexical-semantic net for se- 
mantic enrichment of documents. GermaNet's re- 
sources were primarily integrated into the Seman- 
tic Tagger of XDOC. In future work we will fur- 
ther extend the integration of GermaNet for the 
SIsS analysis and the Semantic Parser. 
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